GDAS009-Bioconductor中的基因组注释

前言

这一部分内容涉及R中使用人类基因且,内含子,外显子,转录本,AnnotationHub,基因组的注释包,GO分析,KEGG分析等,笔记末尾的参考文献是原文。

基础注释资源与发现

在这一部分里,我们将回顾Bioconductor中用于处理和注释基因组序列的一些工具。我们将研究参考基因组序列,转录本和基因,并以基因通路(gene pathway)作为结束。我们学习这一部分的最终目标就是使用注释信息来帮助我们对基因组实验进行可靠的解释。Bioconductor的基本目标就是更加方便地有关基因组结构和功能的信息统计统计分析程序。

注释概念的层次结构

Bioconductor包括许多不同类型的基因组注释。我们可以在层次结构中来理解这些注释资源。

  • 最基因的注释就是某个物种的参考基因组序列。它总是按照核苷酸的线性方式排列成染色体(例如参考基因组。
  • 在此之上就是将染色体序列排列到感兴趣的区域中。最感兴趣的区域就是基因,但是注释中也含有其它的信息,例如SNP或CpG位点。基因具有内部结构,即被转录的部分和未被转录的部分。“基因模式”定义了在基因组坐标中的标记和布置这些结构的方式。
    • 在感兴趣的区域(regions of interest)的理念下,我们还定义了面向平台的注释(platform-oriented annotation)。这处类型的注释通常首先是由厂家提供的,但随着研究的进行,对这些平台中最初有歧义信息进行了确认和更新,从而完善了这些注释内容。密歇根大学的brainarray project 说明了affymetrix阵列注释的过程。我们将在本节最后讨论面向平台注释的问题。
  • 在此之是是将区域(通常是基因或基因的产物)组成成具有共同结构或功能特性的组。例如在细胞中共同被发现的,或者是被鉴定为在生物学过程中协同作用的基因组(我的理解就是GO分析,KEGG分析这一类)。

发现可用的参考基因组

Bioconductor已经包含了注释包的合成,将它这一层次结构上的所有元素都带了可编程环境中。参考基因组序列是使用Biostrings和BSgenome包中的工具进行管理的,available.genomes 函数能够列出构建好的人和现在各种模式生物的参考基因组,如下所示:

1
2
3
4
5
6
7
8
9
10
11
library(Biostrings)
ag = available.genomes()
length(ag)
## [1] 87
head(ag)
## [1] "BSgenome.Alyrata.JGI.v1"
## [2] "BSgenome.Amellifera.BeeBase.assembly4"
## [3] "BSgenome.Amellifera.UCSC.apiMel2"
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Athaliana.TAIR.04232008"
## [6] "BSgenome.Athaliana.TAIR.TAIR9"

参考基因组的版本很重要

不同物种的参考基因组是从头构建的,然后随着算法和测序数据的不断改进而进一步完善。对人类而言,基因组研究联盟(Genome Research Consortium)于2009年构建了37号版本,并于2013年构建了38号版本。

一旦参考基因组构建完成,就哦可以很轻松地对某个物种进行信息丰富的基因组序列分析,因为人们可以专注于那引起已知含有等位基因多样性的区域。

The reference build for an organism is created de novo and then refined as algorithms and sequenced data improve. For humans, the Genome Research Consortium signed off on build 37 in 2009, and on build 38 in 2013.

需要注意的是,基因组序列包含有很长的名称,这个名称里包括版本信息。这样命名的方式就是为了避免与不同版本的参考基因组混淆。在LiftOver这节视频里,我们就展示了如何使用UCSC的liftOver工具与rtracklayer包中的接口对接,从而实现不同版本的基因组坐标转化的过程。

为了帮助用户避免混淆从不同参考基因组坐标上收集分析来的数据,我们提供了一个”基因组“标签,这个标签填充了大多关于序列的信息。在随后的部分里,我们会看到一些案例。用于序列比对的软件可以检查被比对上的序列的兼容标签,从而有助于确保有意义的结果。

H. sapiens的参考基因序列

通过安装并添加一个单独的R包就能获取智人(Homo sapiens)的参考序列。这个程序包定义了一个Hsapiens对象,试剂公司对象是染色体序列的来源,但是当对其进行单独显示时,它会提供相关序列数据来源的信息,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## # chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## # chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## # chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## # chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)
head(genome(Hsapiens)) # see the tag
## chr1 chr2 chr3 chr4 chr5 chr6
## "hg19" "hg19" "hg19" "hg19" "hg19" "hg19"

我们使用 $ 符号来获取17号染色体的序列,如下所示:

1
2
3
Hsapiens$chr17
## 81195210-letter "DNAString" instance
## seq: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT

参考序列的转录本和基因

UCSC注释

TxDb包家族和数据对象管理了转录本和基因模式信息。我们可以认为这些信息来源于UCSC基因组浏览器的注释表,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene # abbreviate
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

我们使用 genes() 来获取Entrez Gene ID的地址,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ghs = genes(txdb)
ghs
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58858172, 58874214] - | 1
## 10 chr8 [ 18248755, 18258723] + | 10
## 100 chr20 [ 43248163, 43280376] - | 100
## 1000 chr18 [ 25530930, 25757445] - | 1000
## 10000 chr1 [243651535, 244006886] - | 10000
## ... ... ... ... . ...
## 9991 chr9 [114979995, 115095944] - | 9991
## 9992 chr21 [ 35736323, 35743440] + | 9992
## 9993 chr22 [ 19023795, 19109967] - | 9993
## 9994 chr6 [ 90539619, 90584155] + | 9994
## 9997 chr22 [ 50961997, 50964905] - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

我们也可以使用合适的标识符进行信息过滤。现在我们提取两个不同基因的外显子,这些外显子由其Entrez基因ID标明,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
filter=list(gene_id=c(100, 101)))
## GRanges object with 39 ranges and 3 metadata columns:
## seqnames ranges strand | EXONID
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr10 [135075920, 135076737] - | 144421
## [2] chr10 [135077192, 135077269] - | 144422
## [3] chr10 [135080856, 135080921] - | 144423
## [4] chr10 [135081433, 135081570] - | 144424
## [5] chr10 [135081433, 135081622] - | 144425
## ... ... ... ... . ...
## [35] chr20 [43254210, 43254325] - | 256371
## [36] chr20 [43255097, 43255240] - | 256372
## [37] chr20 [43257688, 43257810] - | 256373
## [38] chr20 [43264868, 43264929] - | 256374
## [39] chr20 [43280216, 43280376] - | 256375
## TXNAME GENEID
## <CharacterList> <CharacterList>
## [1] uc009ybi.3,uc010qva.2,uc021qbe.1 101
## [2] uc009ybi.3,uc021qbe.1 101
## [3] uc009ybi.3,uc010qva.2,uc021qbe.1 101
## [4] uc009ybi.3 101
## [5] uc010qva.2,uc021qbe.1 101
## ... ... ...
## [35] uc002xmj.3 100
## [36] uc002xmj.3 100
## [37] uc002xmj.3 100
## [38] uc002xmj.3 100
## [39] uc002xmj.3 100
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

ENSEMBL注释

Ensembl home主页上写道:Ensembl创建,整合和发布研究基因组的参考数据库和工具。该项目位于 欧洲分子生物学实验室,该实验室的数据库支持其注释资源可以与Bioconductor兼容。

ensembldb 包含有一个简要说明,其内容如下所示:

ensembldb包提供了一些函数,这些函数用于创建和使用以转录本为中心的注释数据库/包。使用注释数据库的Perl API可以从Ensembl 1中直接获取这些数据。TxDb 包的功能和数据类似于GenomicFeatures包,另外,除了从数据库检索所有的基因/转录本模型和注释外,ensembldb包还提供了一个过滤框架,用于检索特定条目的注释,例如位于染色体区域上的某编码基因或某LincRNA转录模式的特定条目。从1.7版本开始,由ensembldb创建的EnsDb数据库还包含蛋白质注释数据库(参考第11节:数据库而已和可用属性/列的概述)。有关蛋白质注释的信息请参考蛋白质的vignette,如下所示:

1
2
3
4
5
6
library(ensembldb)
library(EnsDb.Hsapiens.v75)
names(listTables(EnsDb.Hsapiens.v75))
## [1] "gene" "tx" "tx2exon" "exon"
## [5] "chromosome" "protein" "uniprot" "protein_domain"
## [9] "entrezgene" "metadata"

举例说明如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
edb = EnsDb.Hsapiens.v75 # abbreviate
txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs
## GRanges object with 20 ranges and 5 metadata columns:
## seqnames ranges strand | protein_id
## <Rle> <IRanges> <Rle> | <character>
## ENST00000335953 11 [113930315, 114121398] + | ENSP00000338157
## ENST00000335953 11 [113930315, 114121398] + | ENSP00000338157
## ENST00000335953 11 [113930315, 114121398] + | ENSP00000338157
## ENST00000335953 11 [113930315, 114121398] + | ENSP00000338157
## ENST00000335953 11 [113930315, 114121398] + | ENSP00000338157
## ... ... ... ... . ...
## ENST00000392996 11 [113931229, 114121374] + | ENSP00000376721
## ENST00000539918 11 [113935134, 114118066] + | ENSP00000445047
## ENST00000545851 11 [114051488, 114118018] + | <NA>
## ENST00000535379 11 [114107929, 114121279] + | <NA>
## ENST00000535509 11 [114117512, 114121198] + | <NA>
## uniprot_id tx_biotype tx_id
## <character> <character> <character>
## ENST00000335953 ZBT16_HUMAN protein_coding ENST00000335953
## ENST00000335953 Q71UL7_HUMAN protein_coding ENST00000335953
## ENST00000335953 Q71UL6_HUMAN protein_coding ENST00000335953
## ENST00000335953 Q71UL5_HUMAN protein_coding ENST00000335953
## ENST00000335953 F5H6C3_HUMAN protein_coding ENST00000335953
## ... ... ... ...
## ENST00000392996 F5H5Y7_HUMAN protein_coding ENST00000392996
## ENST00000539918 <NA> nonsense_mediated_decay ENST00000539918
## ENST00000545851 <NA> processed_transcript ENST00000545851
## ENST00000535379 <NA> processed_transcript ENST00000535379
## ENST00000535509 <NA> retained_intron ENST00000535509
## gene_name
## <character>
## ENST00000335953 ZBTB16
## ENST00000335953 ZBTB16
## ENST00000335953 ZBTB16
## ENST00000335953 ZBTB16
## ENST00000335953 ZBTB16
## ... ...
## ENST00000392996 ZBTB16
## ENST00000539918 ZBTB16
## ENST00000545851 ZBTB16
## ENST00000535379 ZBTB16
## ENST00000535509 ZBTB16
## -------
## seqinfo: 1 sequence from GRCh37 genome

你的数据将会成他人的注释:导入/导出

ENCODE项目很地说明了今天的实验是明天的注释。你应该以同样的方式考虑自己的实验(当然,要使实验成为可靠且持久的注释,它必须解决有关基因组结构或功能的重要问题,并且必须使用适当的,能正确执行的实验流程。需要注意,ENCODE能够非常明确地将实验流程链接到数据)。

例如,我们来看一个雌激素受体结合数据,它是由ENCODE发布的一个narrowPeak 数据。它的碱基是用ascii文本表示的,因此可以很容易地导入为一组文本数据。如果记录的字段有一定的规律性,则可以将文件作为表格导入。

但是,我们不仅是想导入数据,还想将导入的数据作为可计算的对象。我们认识到arrowePeak和bedGraph格式之间的联系后,我们就可以立即将其导入GRanges中。

为了说明这一点,我们在ERBS包中找到narrowPeak原始数据文件的路径,如下所示:

1
2
3
4
5
6
f1 = dir(system.file("extdata",package="ERBS"), full=TRUE)[1]
readLines(f1, 4) # look at a few lines
## [1] "chrX\t1509354\t1512462\t5\t0\t.\t157.92\t310\t32.000000\t1991"
## [2] "chrX\t26801421\t26802448\t6\t0\t.\t147.38\t310\t32.000000\t387"
## [3] "chr19\t11694101\t11695359\t1\t0\t.\t99.71\t311.66\t32.000000\t861"
## [4] "chr19\t4076892\t4079276\t4\t0\t.\t84.74\t310\t32.000000\t1508"

使用import命令非常简单,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library(rtracklayer)
imp = import(f1, format="bedGraph")
imp
## GRanges object with 1873 ranges and 7 metadata columns:
## seqnames ranges strand | score NA.
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chrX [ 1509355, 1512462] * | 5 0
## [2] chrX [26801422, 26802448] * | 6 0
## [3] chr19 [11694102, 11695359] * | 1 0
## [4] chr19 [ 4076893, 4079276] * | 4 0
## [5] chr3 [53288568, 53290767] * | 9 0
## ... ... ... ... . ... ...
## [1869] chr19 [11201120, 11203985] * | 8701 0
## [1870] chr19 [ 2234920, 2237370] * | 990 0
## [1871] chr1 [94311336, 94313543] * | 4035 0
## [1872] chr19 [45690614, 45691210] * | 10688 0
## [1873] chr19 [ 6110100, 6111252] * | 2274 0
## NA.1 NA.2 NA.3 NA.4 NA.5
## <logical> <numeric> <numeric> <numeric> <integer>
## [1] <NA> 157.92 310 32 1991
## [2] <NA> 147.38 310 32 387
## [3] <NA> 99.71 311.66 32 861
## [4] <NA> 84.74 310 32 1508
## [5] <NA> 78.2 299.505 32 1772
## ... ... ... ... ... ...
## [1869] <NA> 8.65 7.281 0.26576 2496
## [1870] <NA> 8.65 26.258 1.995679 1478
## [1871] <NA> 8.65 12.511 1.47237 1848
## [1872] <NA> 8.65 6.205 0 298
## [1873] <NA> 8.65 17.356 2.013228 496
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
genome(imp) # genome identifier tag not set, but you should set it
## chrX chr19 chr3 chr17 chr8 chr11 chr16 chr1 chr2 chr6 chr9 chr7
## NA NA NA NA NA NA NA NA NA NA NA NA
## chr5 chr12 chr20 chr21 chr22 chr18 chr10 chr14 chr15 chr4 chr13
## NA NA NA NA NA NA NA NA NA NA NA

我们可以通过一次获取GRanges。元数据列中还有一些其他字段用于指定名称,但是如果我们只对范围感兴趣,除了添加基因组元数据以防止与不兼容的坐标中记录的数据非法组合外,我们就完成了这个任务(这一段不太理解,原文如下):

We obtain a GRanges in one stroke. There are some additional fields in the metadata columns whose names should be specified, but if we are interested only in the ranges, we are done, with the exception of adding the genome metadata to protect against illegitimate combination with data recorded in an incompatible coordinate system.

为了与其他得养家或系统进行交流,我们有两个主要选择。我们可以将GRanges保存为RData对象,轻松地传递给另外一个R用户使用。或者,我们歌词采用其他标准格式进行导出。例如,如果我们仅对间隔地址和绑定的得分感兴趣,则仅保存为“bed”格式就足够了,如下所示:

1
2
3
4
5
6
7
export(imp, "demoex.bed") # implicit format choice
cat(readLines("demoex.bed", n=5), sep="\n")
## chrX 1509354 1512462 . 5 .
## chrX 26801421 26802448 . 6 .
## chr19 11694101 11695359 . 1 .
## chr19 4076892 4079276 . 4 .
## chr3 53288567 53290767 . 9 .

我们已经进行了导入,建模和导入实验数据之间的“往返”,该实验数据可以与其他数据集成在一起,从而增进生物学的理解。

我们需要注意的是,注释在某种程度上是永久正确的,它与在知识边界上的研究进展乏味地隔离开来。我们已经看到了,甚至人类染色体的参考序列也受到了修订。在使用ERBS包时,我们将未知的实验结果视为定义ER结合位点从而进入潜在的生物学解释。不确定性,峰鉴定的可变质量,尚未得到明确估计,但应该是这个样子。

Bioconductor已经尽力致力于这种情况的多个方面。我们维护软件先前版本和注释的存档,以便可以检查或修改过去的工作。我们每年会两次更新中疏注释资源,以确保正在进行的工作以及获得新知识的稳定性。而且,我们已经简化了导入和创建实验数据和注释数据的表示形式。

AnnotationHub

AnnotationHub包用于获取GRanges或其它的合适设计的容器,用于机构设计的容器,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
library(AnnotationHub)
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
ah = AnnotationHub()
## snapshotDate(): 2017-10-27
ah
## AnnotationHub with 42282 records
## # snapshotDate(): 2017-10-27
## # $$dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih....
## # $$species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos ta...
## # $$rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, ChainFile,...
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH2"]]'
##
## title
## AH2 | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa
## AH3 | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
## AH4 | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
## AH5 | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa
## AH6 | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa
## ... ...
## AH58988 | org.Flavobacterium_piscicida.eg.sqlite
## AH58989 | org.Bacteroides_fragilis_YCH46.eg.sqlite
## AH58990 | org.Pseudomonas_mendocina_ymp.eg.sqlite
## AH58991 | org.Salmonella_enterica_subsp._enterica_serovar_Typhimurium...
## AH58992 | org.Acinetobacter_baumannii.eg.sqlite

我们可以通过AnnotationHub获得许多与HepG2细胞系相关的实验数据对象,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
query(ah, "HepG2")
## AnnotationHub with 440 records
## # snapshotDate(): 2017-10-27
## # $$dataprovider: UCSC, BroadInstitute, Pazar
## # $$species: Homo sapiens, NA
## # $$rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH22246"]]'
##
## title
## AH22246 | pazar_CEBPA_HEPG2_Schmidt_20120522.csv
## AH22249 | pazar_CTCF_HEPG2_Schmidt_20120522.csv
## AH22273 | pazar_HNF4A_HEPG2_Schmidt_20120522.csv
## AH22309 | pazar_STAG1_HEPG2_Schmidt_20120522.csv
## AH22348 | wgEncodeAffyRnaChipFiltTransfragsHepg2CytosolLongnonpolya.b...
## ... ...
## AH41564 | E118-H4K5ac.imputed.pval.signal.bigwig
## AH41691 | E118-H4K8ac.imputed.pval.signal.bigwig
## AH41818 | E118-H4K91ac.imputed.pval.signal.bigwig
## AH46971 | E118_15_coreMarks_mnemonics.bed.gz
## AH49484 | E118_RRBS_FractionalMethylation.bigwig

query 方法可以使用过滤字符串的向量。要限制对寻址组蛋白H4K5的注释资源的响应,只需要添加该标签,如下所示(To limit response to annotation resources addressing the histone H4K5, simply add that tag):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
query(ah, c("HepG2", "H4K5"))
## AnnotationHub with 1 record
## # snapshotDate(): 2017-10-27
## # names(): AH41564
## # $$dataprovider: BroadInstitute
## # $$species: Homo sapiens
## # $$rdataclass: BigWigFile
## # $$rdatadateadded: 2015-05-08
## # $$title: E118-H4K5ac.imputed.pval.signal.bigwig
## # $$description: Bigwig File containing -log10(p-value) signal tracks fr...
## # $$taxonomyid: 9606
## # $$genome: hg19
## # $$sourcetype: BigWig
## # $$sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/signal/cons...
## # $$sourcesize: 226630905
## # $$tags: c("EpigenomeRoadMap", "signal", "consolidatedImputed",
## # "H4K5ac", "E118", "ENCODE2012", "LIV.HEPG2.CNCR", "HepG2
## # Hepatocellular Carcinoma Cell Line")
## # retrieve record with 'object[["AH41564"]]'

The OrgDb基因注释图

那些命名为org.*.ge.db 的包含在基因水平上链接到位置,蛋白产物标识符,KEGG途径和GO term,PMIDs以及其它注释资源的标识符的信息,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns() gives same answer
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
head(select(org.Hs.eg.db, keys="ORMDL3", keytype="SYMBOL",
columns="PMID"))
## 'select()' returned 1:many mapping between keys and columns
## SYMBOL PMID
## 1 ORMDL3 11042152
## 2 ORMDL3 12093374
## 3 ORMDL3 12477932
## 4 ORMDL3 14702039
## 5 ORMDL3 15489334
## 6 ORMDL3 16169070

基因集和通路资源

基因本体论

Gene Ontology (GO)是一种广泛使用的结构化词汇,它组织了基因和基因产物在以下方面的内容:

  • 生物过程
  • 分子功能
  • 细胞组分。

这套词汇本身旨在与所有生物有关。它采用有向无环图的形式,其中term作为节点,使用is-apart-of关系作构成了大多数链接。

将生物体特定基因链接到基因本体中的术语的注释与词汇表本身是分开的,并且涉及不同类型的证据。这些记录都在Bioconductor的注释包中。

我们可以使用GO.db包来快速地访问GO词汇,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(GO.db)
GO.db # metadata
## GODb object:
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2017-Nov01
## | Db type: GODb
## | package: AnnotationDbi
## | DBSCHEMA: GO_DB
## | GOEGSOURCEDATE: 2017-Nov6
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | DBSCHEMAVERSION: 2.1
##
## Please see: help('select') for usage information

使用AnnotationDbi包中的keyscolumnsselect函数也很容易在地id与不同terms之间进行映射,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
k5 = keys(GO.db)[1:5]
cgo = columns(GO.db)
select(GO.db, keys=k5, columns=cgo[1:3])
## 'select()' returned 1:1 mapping between keys and columns
## GOID
## 1 GO:0000001
## 2 GO:0000002
## 3 GO:0000003
## 4 GO:0000006
## 5 GO:0000007
## DEFINITION
## 1 The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.
## 2 The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.
## 3 The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms.
## 4 Enables the transfer of zinc ions (Zn2+) from one side of a membrane to the other, probably powered by proton motive force. In high-affinity transport the transporter is able to bind the solute even if it is only present at very low concentrations.
## 5 Enables the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+ = Zn2+, probably powered by proton motive force. In low-affinity transport the transporter is able to bind the solute only if it is present at very high concentrations.
## ONTOLOGY
## 1 BP
## 2 BP
## 3 BP
## 4 MF
## 5 MF

词汇表的图形结构被编码在SQLite数据库的表中。我们可以使用RSQLite接口对此进行查询,如下所示:

1
2
3
4
5
6
7
con = GO_dbconn()
dbListTables(con)
## [1] "go_bp_offspring" "go_bp_parents" "go_cc_offspring"
## [4] "go_cc_parents" "go_mf_offspring" "go_mf_parents"
## [7] "go_obsolete" "go_ontology" "go_synonym"
## [10] "go_term" "map_counts" "map_metadata"
## [13] "metadata" "sqlite_stat1"

以下查询提示了一些内部标识符:

1
2
3
4
5
6
7
dbGetQuery(con, "select _id, go_id, term from go_term limit 5")
## _id go_id term
## 1 30 GO:0000001 mitochondrion inheritance
## 2 32 GO:0000002 mitochondrial genome maintenance
## 3 33 GO:0000003 reproduction
## 4 37 GO:0042254 ribosome biogenesis
## 5 38 GO:0044183 protein binding involved in protein folding

我们可以将 mitochondrion inheritance term追溯到父项和祖父母项,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dbGetQuery(con, "select * from go_bp_parents where _id=30")
## _id _parent_id relationship_type
## 1 30 26537 is_a
## 2 30 26540 is_a
dbGetQuery(con, "select _id, go_id, term from go_term where _id=26616")
## _id go_id
## 1 26616 GO:0048387
## term
## 1 negative regulation of retinoic acid receptor signaling pathway
dbGetQuery(con, "select * from go_bp_parents where _id=26616")
## _id _parent_id relationship_type
## 1 26616 8389 is_a
## 2 26616 26614 is_a
## 3 26616 26613 negatively_regulates
dbGetQuery(con, "select _id, go_id, term from go_term where _id=5932")
## _id go_id term
## 1 5932 GO:0019237 centromeric DNA binding

将 “mitochondrion inheritance” 视为过程“mitochondrion distribution”和 “organelle inheritance”在概念上的精练是有意义的,这两个term在数据库中被为父项。

可以使用 GO_dbschema()来查看整个数据库模式。

KEGG

自Bioconductor诞生以来,KEGG的注释就能在Bioconductor中人使用了,但KEGG的数据库使用权限已经进行了更改。当我们使用KEGG.db加载后会出现以下信息,如下所示:

1
2
3
4
5
6
7
> library(KEGG.db)
KEGG.db contains mappings based on older data because the original
resource was removed from the the public domain before the most
recent update was produced. This package should now be considered
deprecated and future versions of Bioconductor may not have it
available. Users who want more current data are encouraged to look
at the KEGGREST or reactome.db packages

因此我们可以关注KEGGREST这个包,它需要联网。这是一个非常有用的,基于Entrez标识符的工具。现在我们查询一下BRCA2的信息(它的EntrezID为675),如下所示:

1
2
3
4
5
6
library(KEGGREST)
brca2K = keggGet("hsa:675")
names(brca2K[[1]])
## [1] "ENTRY" "NAME" "DEFINITION" "ORTHOLOGY" "ORGANISM"
## [6] "PATHWAY" "DISEASE" "BRITE" "POSITION" "MOTIF"
## [11] "DBLINKS" "STRUCTURE" "AASEQ" "NTSEQ"

我们也可以通过keggGet函数来获取构成通路模式的基因列表,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brpat = keggGet("path:hsa05212")
names(brpat[[1]])
## [1] "ENTRY" "NAME" "DESCRIPTION" "CLASS" "PATHWAY_MAP"
## [6] "DISEASE" "DRUG" "ORGANISM" "GENE" "COMPOUND"
## [11] "KO_PATHWAY" "REFERENCE"
brpat[[1]]$GENE[seq(1,132,2)] # entrez gene ids
## [1] "3845" "5290" "5293" "5291" "5295" "5296" "8503" "9459"
## [9] "5879" "5880" "5881" "4790" "5970" "207" "208" "10000"
## [17] "1147" "3551" "8517" "572" "598" "842" "369" "673"
## [25] "5894" "5604" "5594" "5595" "5599" "5602" "5601" "5900"
## [33] "5898" "5899" "10928" "998" "7039" "1950" "1956" "2064"
## [41] "2475" "6198" "6199" "3716" "6774" "6772" "7422" "1029"
## [49] "1019" "1021" "595" "5925" "1869" "1870" "1871" "7157"
## [57] "1026" "1647" "4616" "10912" "581" "578" "1643" "51426"
## [65] "7040" "7042"

KEGGREST还有许多值得研究的地方,例如还可以查询BRCA2(人类)关于胰腺癌途径的静态图像,如下所示:

1
2
3
4
library(png)
library(grid)
brpng = keggGet("hsa05212", "image")
grid.raster(brpng)

plot of chunk getp

其它本体

rols包含有与EMBL-EBI连接的接口 Ontology Lookup Service.

1
2
3
4
5
6
7
8
9
10
11
12
library(rols)
oo = Ontologies()
oo
## Object of class 'Ontologies' with 198 entries
## GENEPIO, MP ... SEPIO, SIBO
oo[[1]]
## Ontology: Genomic Epidemiology Ontology (genepio)
## The Genomic Epidemiology Ontology (GenEpiO) covers vocabulary
## necessary to identify, document and research foodborne pathogens
## and associated outbreaks.
## Loaded: 2017-04-10 Updated: 2017-10-20 Version: 2017-04-09
## 4351 terms 137 properties 38 individuals

为了控制查询检索中涉及的网络流量,搜索分为几个阶段,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
glis = OlsSearch("glioblastoma")
glis
## Object of class 'OlsSearch':
## query: glioblastoma
## requested: 20 (out of 502)
## response(s): 0
res = olsSearch(glis)
dim(res)
## NULL
resdf = as(res, "data.frame") # get content
resdf[1:4,1:4]
## id
## 1 ncit:class:http://purl.obolibrary.org/obo/NCIT_C3058
## 2 omit:http://purl.obolibrary.org/obo/OMIT_0007102
## 3 ordo:class:http://www.orpha.net/ORDO/Orphanet_360
## 4 hp:class:http://purl.obolibrary.org/obo/HP_0100843
## iri short_form label
## 1 http://purl.obolibrary.org/obo/NCIT_C3058 NCIT_C3058 Glioblastoma
## 2 http://purl.obolibrary.org/obo/OMIT_0007102 OMIT_0007102 Glioblastoma
## 3 http://www.orpha.net/ORDO/Orphanet_360 Orphanet_360 Glioblastoma
## 4 http://purl.obolibrary.org/obo/HP_0100843 HP_0100843 Glioblastoma
resdf[1,5] # full description for one instance
## [[1]]
## [1] "The most malignant astrocytic tumor (WHO grade IV). It is composed of poorly differentiated neoplastic astrocytes and it is characterized by the presence of cellular polymorphism, nuclear atypia, brisk mitotic activity, vascular thrombosis, microvascular proliferation and necrosis. It typically affects adults and is preferentially located in the cerebral hemispheres. It may develop from diffuse astrocytoma WHO grade II or anaplastic astrocytoma (secondary glioblastoma, IDH-mutant), but more frequently, it manifests after a short clinical history de novo, without evidence of a less malignant precursor lesion (primary glioblastoma, IDH- wildtype). (Adapted from WHO)"

ontologyIndex 包支持导入开放生物本体(OBO, Open Biological Ontologies)格式的数据,并含有用于查询和可视化本体系统高效的工具。

通用基因集管理

GSEABase 包有一个用于管理基因集和集合的优秀工具。我们可以从MSigDb中导入胶质母细胞瘤相关的基因集来说明一下,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(GSEABase)
glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
## Warning in readLines(con, ...): incomplete final line found on '/
## Library/Frameworks/R.framework/Versions/3.4/Resources/library/ph525x/gmt/
## glioSets.gmt'
glioG
## GeneSetCollection
## names: BALDWIN_PRKCI_TARGETS_UP, BEIER_GLIOMA_STEM_CELL_DN, ..., ZHENG_GLIOBLASTOMA_PLASTICITY_UP (47 total)
## unique identifiers: ADA, AQP9, ..., ZFP28 (3671 total)
## types in collection:
## geneIdType: NullIdentifier (1 total)
## collectionType: NullCollection (1 total)
head(geneIds(glioG[[1]]))
## [1] "ADA" "AQP9" "ATP2B4" "ATP6V1G1" "CBX6" "CCDC165"

模式生物的统一,自我描述方法

OrganismDb包简化了对注释的访问。还可以针对TxDborg.[Nn].eg.db进行直接查询,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
library(Homo.sapiens)
class(Homo.sapiens)
## [1] "OrganismDb"
## attr(,"package")
## [1] "OrganismDbi"
Homo.sapiens
## OrganismDb Object:
## # Includes GODb Object: GO.db
## # With data about: Gene Ontology
## # Includes OrgDb Object: org.Hs.eg.db
## # Gene data about: Homo sapiens
## # Taxonomy Id: 9606
## # Includes TxDb Object: TxDb.Hsapiens.UCSC.hg19.knownGene
## # Transcriptome data about: Homo sapiens
## # Based on genome: hg19
## # The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .
tx = transcripts(Homo.sapiens)
## 'select()' returned 1:1 mapping between keys and columns
keytypes(Homo.sapiens)
## [1] "ACCNUM" "ALIAS" "CDSID" "CDSNAME"
## [5] "DEFINITION" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [9] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL"
## [13] "EXONID" "EXONNAME" "GENEID" "GENENAME"
## [17] "GO" "GOALL" "GOID" "IPI"
## [21] "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL"
## [25] "PATH" "PFAM" "PMID" "PROSITE"
## [29] "REFSEQ" "SYMBOL" "TERM" "TXID"
## [33] "TXNAME" "UCSCKG" "UNIGENE" "UNIPROT"
columns(Homo.sapiens)
## [1] "ACCNUM" "ALIAS" "CDSCHROM" "CDSEND"
## [5] "CDSID" "CDSNAME" "CDSSTART" "CDSSTRAND"
## [9] "DEFINITION" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [13] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL"
## [17] "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
## [21] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
## [25] "GENENAME" "GO" "GOALL" "GOID"
## [29] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [33] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [37] "PROSITE" "REFSEQ" "SYMBOL" "TERM"
## [41] "TXCHROM" "TXEND" "TXID" "TXNAME"
## [45] "TXSTART" "TXSTRAND" "TXTYPE" "UCSCKG"
## [49] "UNIGENE" "UNIPROT"

面向平台的注释

通过在NCBI GEO的GPL信息页面 上对信息进行排序,我们就可以看到最常用的寡核苷阵列平台(数据库中有4760个系列)就是Affy Human Genome U133 plus 2.0 array (GPL 570)。我们可以使用hgu133plus2.db对这些数据进行注释,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(hgu133plus2.db)
##
hgu133plus2.db
## ChipDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: ChipDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMANCHIP_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | MANUFACTURER: Affymetrix
## | CHIPNAME: Human Genome U133 Plus 2.0 Array
## | MANUFACTURERURL: http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus
## | EGSOURCEDATE: 2015-Sep27
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: ENTREZID
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 20150919
## | GOEGSOURCEDATE: 2015-Sep27
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
## | GPSOURCEDATE: 2010-Mar22
## | ENSOURCEDATE: 2015-Jul16
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.uniprot.org/
## | UPSOURCEDATE: Thu Oct 1 23:31:58 2015
##
## Please see: help('select') for usage information

这个资源(以及ChipDb类的所有实例)的基本目的是在探针集(probeset)识别符和更高层次的基因组注释之间进行映射。

有关探针的详细信息(探针集的组成部分)已经由那些后缀为probe的文件包提供,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library(hgu133plus2probe)
head(hgu133plus2probe)
## sequence x y Probe.Set.Name
## 1 CACCCAGCTGGTCCTGTGGATGGGA 718 317 1007_s_at
## 2 GCCCCACTGGACAACACTGATTCCT 1105 483 1007_s_at
## 3 TGGACCCCACTGGCTGAGAATCTGG 584 901 1007_s_at
## 4 AAATGTTTCCTTGTGCCTGCTCCTG 192 205 1007_s_at
## 5 TCCTTGTGCCTGCTCCTGTACTTGT 844 979 1007_s_at
## 6 TGCCTGCTCCTGTACTTGTCCTCAG 537 971 1007_s_at
## Probe.Interrogation.Position Target.Strandedness
## 1 3330 Antisense
## 2 3443 Antisense
## 3 3512 Antisense
## 4 3563 Antisense
## 5 3570 Antisense
## 6 3576 Antisense
dim(hgu133plus2probe)
## [1] 604258 6

将探针集标识符映射到基因水平的信息可以提示一些有意思的歧视,如下所示:

1
2
3
4
5
6
7
8
9
select(hgu133plus2.db, keytype="PROBEID",
columns=c("SYMBOL", "GENENAME", "PATH", "MAP"), keys="1007_s_at")
## 'select()' returned 1:many mapping between keys and columns
## PROBEID SYMBOL GENENAME PATH
## 1 1007_s_at DDR1 discoidin domain receptor tyrosine kinase 1 <NA>
## 2 1007_s_at MIR4640 microRNA 4640 <NA>
## MAP
## 1 6p21.33
## 2 6p21.33

显然,该探针集合可以用于mRNA和miRNA丰度的定量。作为稳定的检查,我们可以看到,不同的符号映射到了相同的细胞带(最后一句不懂,原文为: As a sanity check we see that the distinct symbols map to the same cytoband)。

总结

我们现在已经拥有了含有从核酸到通路水平的许多数据。通过Bioconductor.org上的View就可以查看现有的一些资源。

参考资料

  1. Genomic annotation in Bioconductor: The general situation